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Abstract 

Symmetric positive semi-definite (SPSD) matrix approximation methods have been 
extensively used to speed up large-scale eigenvalue computation and kernel learning 
methods. The standard sketch based method, which we call the prototype model, produces 
relatively accurate approximations, but is inefficient on large square matrices. The Nystrom 
method is highly efficient, but can only achieve low accuracy. In this paper we propose a 
novel model that we call the fast SPSD matrix approximation model. The fast model is 
nearly as efficient as the Nystrom method and as accurate as the prototype model. We 
show that the fast model can potentially solve eigenvalue problems and kernel learning 
problems in linear time with respect to the matrix size n to achieve 1 -I- e relative-error, 
whereas both the prototype model and the Nystrom method cost at least quadratic time 
to attain comparable error bound. Empirical comparisons among the prototype model, 
the Nystrom method, and our fast model demonstrate the superiority of the fast model. 
We also contribute new understandings of the Nystrom method. The Nystrom method is 
a special instance of our fast model and is approximation to the prototype model. Our 
technique can be straightforwardly applied to make the CUR matrix decomposition more 
efficiently computed without much affecting the accuracy. 

Keywords: Kernel approximation, matrix factorization, the Nystrom method, CUR 

matrix decomposition 


1. Introduction 

With limited computational and storage resource, machine-precision inversion and decom¬ 
positions of large and dense matrix are prohibitive. In the past decade matrix approximation 
techniques have been extensively studied by the theoretical computer science community 
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(Woodruff, 2014), the machine learning community (Mahoney, 2011), and the numerical 
linear algebra community (Halko et al., 2011). 

In machine learning, many graph analysis techniques and kernel methods require 
expensive matrix computations on symmetric matrices. The truncated eigenvalue decom¬ 
position (that is to find a few eigenvectors corresponding to the greatest eigenvalues) 
is widely used in graph analysis such as spectral clustering, link prediction in social 
networks (Shin et al., 2012), graph matching (Patro and Kingsford, 2012), etc. Kernel 
methods (Scholkopf and Smola, 2002) such as kernel PCA and manifold learning require the 
truncated eigenvalue decomposition. Some other kernel methods such as Gaussian process 
regression/classification require solving n x n matrix inversion, where n is the number of 
training samples. The rank k {k n) truncated eigenvalue decomposition (/c-eigenvalue 
decomposition for short) of an n x n matrix costs time 0(n^A:)^; the matrix inversion costs 
time 0(n^). Thus, the standard matrix computation approaches are infeasible when n is 
large. 

For kernel methods, we are typically given n data samples of dimension d, while the 
n X n kernel matrix K is unknown beforehand and should be computed. This adds to the 
additional 0{n^d) time cost. When n and d are both large, computing the kernel matrix 
is prohibitively expensive. Thus, a good kernel approximation method should avoid the 
computation of the entire kernel matrix. 

Typical SPSD matrix approximation methods speed up matrix computation by effi¬ 
ciently forming a low-rank decomposition K ~ CUC^ where C G is a sketch of 

K (e.g., randomly sampled c columns of K) and U G can be computed in different 
ways. With such a low-rank approximation at hand, it takes only 0{nc^) additional time to 
approximately compute the rank k {k < c) eigenvalue decomposition or the matrix inversion. 
Therefore, if C and U are obtained in linear time (w.r.t. n) and c is independent of n, then 
the aforementioned eigenvalue decomposition and matrix inversion can be approximately 
solved in linear time. 

The Nystrom method is perhaps the most widely used kernel approximation method. 
Let P be an re X c sketching matrix such as uniform sampling (Williams and Seeger, 2001; 
Gittens, 2011), adaptive sampling (Kumar et al., 2012), leverage score sampling (Gittens 
and Mahoney, 2016), etc. The Nystrom method computes C by C = KP G and U 
by U = (P^C)’t' G This way of computing U is very efficient, but it incurs relatively 

large approximation error even if C is a good sketch of K. As a result, the Nystrom method 
is reported to have low approximation accuracy in real-world applications (Dai et ah, 2014; 
Hsieh et al., 2014; Si et ah, 2014b). In fact, the Nystrom is impossible to attain 1 -f e 
bound relative to ||K — Kfc|||, unless c > D(Y^re/c/e) (Wang and Zhang, 2013). Here 
denotes the best rank-/c approximation of K. The requirement that c grows at least linearly 
with ^/n is a very pessimistic result. It implies that in order to attain 1 -|- e relative-error 
bound, the time cost of the Nystrom method is of order rec^ = D(re^A:/e) for solving the 
/c-eigenvalue decomposition or matrix inversion, which is quadratic in re. Therefore, under 
the 1 -h e relative-error requirement, the Nystrom method is not a linear time method. 

The main reason for the low accuracy of the Nystrom method is due to the way that the 
U matrix is calculated. In fact, much higher accuracy can be obtained if U is calculated 

1. The O notation hides the logarithm factors. 
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by solving the minimization problem minu ||K — CUC^|||n, which is a standard way to 
approximate symmetric matrices (Halko et ah, 2011; Gittens and Mahoney, 2016; Wang and 
Zhang, 2013; Wang et al., 2016). This is the randomized SVD for symmetric matrices (Halko 
et ah, 2011). Wang et al. (2016) called this approach the prototype model and provided 
an algorithm that samples c = 0{kje) columns of K to form C such that minu l|K — 
CUC'^|||n < (1 + e)||K — Kfc|||.. Unlike the Nystrom method, the prototype model does not 
require c to grow with n. The downside of the prototype model is the high computational 
cost. It requires the full observation of K and O(n^c) time to compute U. Therefore when 
applied to kernel approximation, the time cost cannot be less than 0{n?d + 'n?c). To reduce 
the computational cost, this paper considers the problem of efficient calculation of U with 
fixed C while achieving an accuracy comparable to the prototype model. 

More specihcally, the key question we try to answer in this paper can be described as 
follows. 

Question 1 For any fixed n x n symmetric matrix K, target rank k, and parameter 7 , 
assume that 

Al We are given a sketch matrix C € o/K, which is obtained in time Time{C); 

A2 The matrix C is a good sketch of K in that minu l!K ~ CUC^|||. < (1 + 7 )|jK — Kfe|j|.. 

Then we would like to know whether for an arbitrary e, it is possible to compute C and 
U such that the following two requirements are satisfied: 

R1 The matrix U has the following error hound: 

||K - CUC^III, < (1 + e)(l + 7 )||K - Kfe|||. 

R2 The procedure of computing C and U and approximately solving the aforementioned k- 
eigenvalue decomposition or the matrix inversion run in time 0{n ■ poly(fc, 7 “^, e“^)) + 
Time(C). 

Unfortunately, the following theorem shows that neither the Nystrom method nor the 
prototype model enjoys such desirable properties. We prove the theorem in Appendix B. 

Theorem 1 Neither the Nystrom method nor the prototype model satisfies the two 
requirements in Question 1. To make requirement R1 hold, both the Nystrom method and 
the prototype model cost time no less than 0(n^ ■ poly(/c, 7 “^, e“^)) + Time{C) which is at 
least quadratic in n. 

In this paper we give an affirmative answer to the above question. In particular, it has 
the following consequences. First, the overall approximation has high accuracy in the sense 
that II K — CUC^|||. is comparable to minu II K — CUC^|||., and is thereby comparable to 
the best rank k approximation. Second, with C at hand, the matrix U is obtained efficiently 
(linear in n). Third, with C and U at hand, it takes extra time which is also linear in n 
to compute the aforementioned eigenvalue decomposition or linear system. Therefore, with 
a good C, we can use linear time to obtain desired U matrix such that the accuracy is 
comparable to the best possible low-rank approximation. 

The CUR matrix decomposition (Mahoney and Drineas, 2009) is closely related to the 
prototype model and troubled by the same computational problem. The CUR matrix 
decomposition is an extension of the prototype model from symmetric matrices to general 
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matrices. Given any mxn fixed matrix A, the CUR matrix decomposition selects c columns 
of A to form C G and r rows of A to form R G and computes matrix U G 

such that IIA — CUR|||. is small. Traditionally, it costs time 

0{mn ■ min{c, r}) 

to compute the optimal U* = ARl (Stewart, 1999; Wang and Zhang, 2013; Boutsidis and 
Woodruff, 2014). How to efficiently compute a high-quality U matrix for CUR is unsolved. 

1.1 Main Results 

This work is motivated by an intrinsic connection between the Nystrom method and the 
prototype model. Based on a generalization of this observation, we propose the fast SPSD 
matrix approximation model for approximating any symmetric matrix. We show that the 
fast model satisfies the requirements in Question 1. Given n data points of dimension d, 
the fast model computes C and and approximately solves the truncated eigenvalue 

decomposition or matrix inversion in time 

0{n(? je -b n(?dle) + Time(C). 

Here Time(C) is dehned in Question 1. 

The fast SPSD matrix approximation model achieves the desired properties in Question 1 
by solving minu ||K — CUC^Hi;’ approximately rather than exactly while ensuring 

||K - CUf'""*C^||| < (1 + e) nun ||K - CUC^|||. 

The time complexity for computing is linear in n, which is far less than the time 

complexity 0{n^c) of the prototype model. Our method also avoids computing the entire 
kernel matrix K; instead, it computes a block of K of size x which is substantially 
smaller than n x n. The lower bound in Theorem 7 indicates that the ^/n factor here is 
optimal, but the dependence on c and e are suboptimal and can be potentially improved. 

This paper provides a new perspective on the Nystrom method. We show that, as well as 
our fast model, the Nystrom method is approximate solution to the problem minu ||CUC^— 
K|||,. Unfortunately, the approximation is so rough that the quality of the Nystrom method 
is low. 

Our method can also be applied to improve the CUR matrix decomposition of the general 
matrices which are not necessarily square. Given any matrices A G C G and 

R G it costs time 0{mn ■ min{c, r}) to compute the matrix U = C^AR^. Applying 

our technique, the time cost drops to only 

0(yCre~^ ■ min{m, n} • min{c, r}), 

while the approximation quality is nearly the same. 

1.2 Paper Organization 

The remainder of this paper is organized as follows. Section 2 defines the notation used 
in this paper. Section 3 introduces the related work of matrix sketching and SPSD matrix 
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Table 1: A summary of the notation. 


Notation 

Description 

n 

number of data points 

d 

dimension of the data point 

K 

nxn kernel matrix 

P, S 

sketching matrices 

C 

n X c sketch computed by C = KP 

U* 

CfK(Cl)^ G —the U matrix of the prototype model 

Xjnys 

(P^K)I G —the U matrix of the Nystrom method 

IQfast 

(S^C)I(S^KS)(C^S)1 G —the U matrix of the fast model 


approximation. Section 4 describes our fast model and analyze the time complexity and 
error bound. Section 5 applies the technique of the fast model to compute the CUR matrix 
decomposition more efficiently. Section 6 conducts empirical comparisons to show the effect 
of the U matrix. The proofs of the theorems are in the appendix. 

2. Notation 

The notation used in this paper are defined as follows. Let [n] = n}, be the 

nxn identity matrix, and In be the n x 1 vector of all ones. We let a: G y ± 2 ; denote 
y — z<x<y + z. For an mxn matrix A = [A^], we let aj, be its i-th row, Si-j be its 
j-th column, nnz(A) be the number of nonzero entries of A, HAHi? = j be its 

Frobenius norm, and ||A ||2 = maxx^o || Ax|| 2 /||x ||2 be its spectral norm. 

Let p = rank(A). The condensed singular value decomposition (SVD) of A is defined 
as 

A = USV^ = 

i=l 

where ui,--- ,ar are the positive singular values in the descending order. We also use 
(Tj(A) to denote the i-th largest singular value of A. Unless otherwise specihed, in this 
paper “SVD” means the condensed SVD. Let A^ = XliLi CjUjV?’ be the top k principal 
components of A for any positive integer k less than p. In fact, Ak is the closest to A 
among all the rank k matrices. Let A^ = be the Moore-Penrose inverse of A. 

Assume that p = rank(A) < n. The column leverage scores of A are li = ||vj;||| for 
i = 1 to n. Obviously, h + ■■■ + In = P- The column coherence is defined by zz(A) = 
^ maxjg[„] ||vj,|| 2 . If /9 = rank(A) < m, the row leverage scores and coherence are similarly 
defined. The row leverage scores are ||ui:|| 2 , • • • , ||um ,;||2 and the row coherence is p{A) = 

m |L, ||2 

p Il^«:|l2* 

We also list some frequently used notation in Table 1. Given the decomposition K = 
CUC^ ~ K which has rank at most c, it takes O(nc^) time to compute the eigenvalue 
decomposition of K and O(nc^) time to solve the linear system (K + aln)w = y to obtain 
w (see Appendix A for more discussions). The truncated eigenvalue decomposition and 
linear system are the bottleneck of many kernel methods, and thus an accurate and efficient 
low-rank approximation can help to accelerate the computation of kernel learning. 
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3. Related Work 

In Section 3.1 we introduce matrix sketching. In Section 3.2 we describe two SPSD matrix 
approximation methods. 


3.1 Matrix Sketching 

Popular matrix sketching methods include uniform sampling, leverage score sampling 
(Drincas et ah, 2006, 2008; Woodruff, 2014), Gaussian projection (Johnson and Linden- 
strauss, 1984), subsampled randomized Hadamard transform (SRHT) (Drineas et ah, 2011; 
Lu et ah, 2013; Tropp, 2011), count sketch (Charikar et ah, 2004; Clarkson and Woodruff, 
2013; Meng and Mahoney, 2013; Nelson and Nguyen, 2013; Pham and Pagh, 2013; Thorup 
and Zhang, 2012; Weinberger et ah, 2009), etc. 


3.1.1 Column Sampling 


Let pi,--- ,Pn £ (0)1) with = 1 be the sampling probabilities. Let each integer 

in [n] be independently sampled with probabilities spi,-- - ^spn-, where s G [n] is integer. 
Assume that s integers are sampled from [n]. Let ii,--- ,is denote the selected integers, 
and let E[s] = s. We scale each selected column by , • • • , ^ lp ._ ; respectively. Uniform 

sampling means that the sampling probabilities are pi = • • • = pn = Leverage score 
sampling means that the sampling probabilities are proportional to the leverage scores 
/i, • • • , Zn of a certain matrix. 

We can equivalently characterize column selection by the matrix S G Each column 

of S has exactly one nonzero entry; let {ij,j) be the position of the nonzero entry in the 
j-th column for j G [s]. For j = 1 to s, we set 


= 




( 1 ) 


The expectation E[s] equals to s, and s = 0(s) with high probability. For the sake of 
simplicity and clarity, in the rest of this paper we will not distinguish s and s. 


3.1.2 Random Projection 

Let G G be a standard Gaussian matrix, namely each entry is sampled independently 
from AA(0,1). The matrix S = ;^G is a Gaussian projection matrix. Gaussian projection is 
also well known as the Johnson-Lindenstrauss (JL) transform (Johnson and Lindenstrauss, 
1984); its theoretical property is well established. It takes 0{mns) time to apply S G 
to any m x n dense matrix, which makes Gaussian projection inefficient. 

The subsampled randomized Hadamard transform (SRHT) is usually a more efficient 
alternative of Gaussian projection. Let H^, G be the Walsh-Hadamard matrix with 

+1 and —1 entries, D G be a diagonal matrix with diagonal entries sampled uniformly 

from {+1, —1}, and P G be the uniform sampling matrix defined above. The matrix 
S = ^DH„P G is an SRHT matrix, and it can be applied to any m x n matrix in 

0{mnlogs) time. 

Count sketch stems from the data stream literature (Charikar et ah, 2004; Thorup and 
Zhang, 2012) and has been applied to speedup matrix computation. The count sketch 


6 






Towards More Efficient SPSD Matrix Approximation and CUR Matrix Decomposition 


matrix S G can be applied to any matrix A in 0(nnz(A)) time where nnz denotes 

the number of non-zero entries. The readers can refer to (Woodruff, 2014) for detailed 
descriptions of count sketch. 

3.1.3 Theories 

The following lemma shows important properties of the matrix sketching methods. In the 
lemma, leverage score sampling means that the sampling probabilities are proportional to 
the row leverage scores of the column orthogonal matrix U G (Here U is different 

from the notation elsewhere in the paper.) We prove the lemma in Appendix C. 

Lemma 2 Let U G be any fixed matrix with orthonormal columns and B G be 

any fixed matrix. Let S G be any sketching matrix considered in this section; the order 


of s (with the O-notation omitted) is listed in Table 2. Then 


P| U^SS'^U - Ifc 2 > r/| < di 

(Property 1), 

P{||U'^B-U^SS^B||^ > e||B|||| < 82 

(Property 2), 

pjllu'^B-U^SS^B ||2 > e'||B||^ + |||B|||| < ^3 

(Property 3). 


Table 2; The leverage score sampling is w.r.t. the row leverage scores of U. For uniform 
sampling, the notation /r(U) G [l,u] is the row coherence of U. 


Sketching 

Property 1 

Property 2 

Property 3 

Leverage Sampling 


k 

€(52 

— 

Uniform Sampling 

<5i 

At(U)fc 

£<52 

— 

Gaussian Projection 

A:+log(l/(5i) 

^2 

k 

€(52 


SRHT 

log 5^; 

/c+log n 
£<52 

i(t + logSf)log| 

Count Sketch 

Jp_ 

<5i r)2 

k 

£<52 

— 


Property 1 is known as the subspace embedding property (Woodruff, 2014). It shows 
that all the singular values of S^U are close to one. Properties 2 and 3 show that sketching 
preserves the multiplication of a row orthogonal matrix and an arbitrary matrix. 

For the SPSD/CUR matrix approximation problems, the three properties are all we need 
to capture the randomness in the sketching methods. Leverage score sampling, uniform 
sampling, and count sketch do not enjoy Property 3, but it is fine— Frobenius norm 
(Property 2) will be used as a loose upper bound on the spectral norm (Property 3). 
Gaussian projection and SRHT satisfy all the three properties; when applied to the 
SPSD/CUR problems, their error bounds are stronger than the leverage score sampling, 
uniform sampling, and count sketch. 
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3.2 SPSD Matrix Approximation Models 

We first describe the prototype model and the Nystrom method, which are most relevant 
to this work. We then introduce several other SPSD matrix approximation methods. 

3.2.1 Most Relevant Work 

Given an n x n matrix K and an n x c sketching matrix P, we let C = KP and W = 
P^C = P^KP. The prototype model (Wang and Zhang, 2013) is defined by 

^proto A CU*C^ = CCtK(Cl')^C'^, (2) 

and the Nystrom method is defined by 

j^nys A = CW^C^ 

= C(P^C)^(P^KP)(C'^P)'^C^. (3) 

The only difference between the two models is their U matrices, and the difference leads to 
big difference in their approximation accuracies. Wang and Zhang (2013) provided a lower 
error bound of the Nystrom method, which shows that no algorithm can select less than 
^l{-\/nk/e) columns of K to form C such that 

||K - < (1 + e)||K - Kfcll^. 

In contrast, the prototype model can attain the 1 + e relative-error bound with c = Oikje) 
(Wang et ah, 2016), which is optimal up to a constant factor. 

While we have mainly discussed the time complexity of kernel approximation in the 
previous sections, the memory cost is often a more important issue in large scale problems 
due to the limitation of computer memory. The Nystrom method and the prototype 
model require 0{nc) memory to hold C and U to approximately solve the aforementioned 
eigenvalue decomposition or the linear system.^ Therefore, we hope to make c as small as 
possible while achieving a low approximation error. There are two elements: (1) a good 
sketch C = KP, and (2) a high-quality U matrix. We focus on the latter in this paper. 

3.2.2 Less Relevant Work 

We note that there are many other kernel approximation approaches in the literature. 
However, these approaches do not directly address the issue we consider here, so they are 
complementary to our work. These studies are either less effective or inherently rely on the 
Nystrom method. 

The Nystrom-like models such as MEKA (Si et ah, 2014a) and the ensemble Nystrom 
method (Kumar et ah, 2012) are reported to significantly outperform the Nystrom method 
in terms of approximation accuracy, but their key components are still the Nystrom method 
and the component can be replaced by any other methods such as the method studied in 
this work. The spectral shifting Nystrom method (Wang et ah, 2014) also outperforms the 

2. The memory costs of the prototype model is 0{nc + nd) rather than 0{n^). This is because we can hold 
the n X d data matrix and the c x n matrix in memory, compute a small block of K each time, and 
then compute C^K block by block. 
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Nystrom method in certain situations, but the spectral shifting strategy can be used for 
any other kernel approximation models beyond the prototype model. We do not compare 
with these methods in this paper because MEKA, the ensemble Nystrom method, and the 
spectral shifting Nystrom method can all be improved if we replace the underlying Nystrom 
method or the prototype model by the new method developed here. 

The column-based low-rank approximation model (Kumar et ah, 2009) is another SPSD 
matrix approximation approach different from the Nystrom-like methods. Let P G 
be any sketching matrix and C = KP. The column-based model approximates K by 
= (CC^)^/^. Equivalently, it approximates by 

K^K PS CC^ = K^PP^K. 

From Lemma 2 we can see that it is a typical sketch based approximation to the matrix 
multiplication. Unfortunately, the approximate matrix multiplication is effective only when 
K has much more rows than columns, which is not true for the kernel matrix. The column- 
based model does not have good error bound and is not empirically as good as the Nystrom 
method (Kumar et ah, 2009). 

The random feature mapping (Rahimi and Recht, 2007) is a family of kernel approxima¬ 
tion methods. Each random feature mapping method is applicable to certain kernel rather 
than arbitrary SPSD matrix. Furthermore, they are known to be noticeably less effective 
than the Nystrom method (Yang et ah, 2012). 

4. The Fast SPSD Matrix Approximation Model 

In Section 4.1 we present the motivation behind the fast model. In Section 4.2 we provide 
an alternative perspective on our fast model and the Nystrom method by formulating them 
as approximate solutions to an optimization problem. In Section 4.3 we analyze the error 
bound of the fast model. Theorem 3 is the main theorem, which shows that in terms 
of the Frobenius norm approximation, the fast model is almost as good as the prototype 
model. In Section 4.4 we describe the implementation of the fast model and analyze the 
time complexity. In Section 4.5 we give some implementation details that help to improve 
the approximation quality. In Section 4.6 we show that our fast model exactly recovers K 
under certain conditions, and we provide a lower error bound of the fast model. 

4.1 Motivation 

Let P G be sketching matrix and C = KP G The fast SPSD matrix 

approximation model is defined by 

Kif = C(S^C)'^(S^KS)(C^S)^C^, 
where S is n x s sketching matrix. 

From (2) and (3) we can see that the Nystrom method is a special case of the fast model 
where S is defined as P and that the prototype model is a special case where S is defined 
as 1,2 • 

The fast model allows us to trade off the accuracy and the computational cost—larger 
s leads to higher accuracy and higher time cost, and vice versa. Setting s as small as c 
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Table 3: Summary of the time cost of the models for computing the U matrices and the 
number of entries of K required to be observed in order to compute the U matrices. 
As for the fast model, assume that S is column selection matrix. The notation is 
defined previously in Table 1. 



Time 

^Entries 

Nystrom 

O(c^) 

nc 

Prototype 

0(nnz(K)c + nc^) 

V? 

Fast 

0{nc^ + s^c) 

nc+ {s — c)^ 



Nystrom Prototype Fast 

Figure 1: The ■'ollow blc.h.. denote the submatrices of K that must be seen by the kernel 
approximation models. The Nystrom method computes an n x c block of K, 
provided that P is column selection matrix; the prototype model computes the 
entire nxn matrix K; the fast model computes an nxc block and an (s—c) x (s—c) 
block of K (due to the symmetry of K), provided that P and S are column 
selection matrices. 


sacrihces too much accuracy, whereas setting s as large as n is unnecessarily expensive. 
Later on, we will show that s = 0{c-\/nJe) <C n is a good choice. The setting s n 
makes the fast model much cheaper to compute than the prototype model. When applied 
to kernel methods, the fast model avoids computing the entire kernel matrix. We summarize 
the time complexities of the three matrix approximation methods in Table 3; the middle 
column lists the time cost for computing the U matrices given C and K; the right column 
lists the number of entry of K which much be observed. We show a very intuitive comparison 
in Figure 1. 

4.2 Optimization Perspective 

With the sketch C = KP G at hand, we want to hnd the U matrix such that 

CUC^ ~ K. It is very intuitive to solve the following problem to make the approximation 
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tight: 

U* = argminllCUC^-Klip = CtK(C'^)'^. (4) 

u 

This is the prototype model. Since solving this system is time expensive, we propose to 
draw a sketching matrix S G and solve the following problem instead: 

= argmin||S^(CUC^-K)S||p 

u 

= argmin||(S^C)U(S^C)^-S^KS||p 

u 

= (S^C)t(S^KS)(C^S)t, (5) 

which results in the fast model. Similar ideas have been exploited to efficiently solve the 
least squares regression problem (Drineas et ah, 2006, 2011; Clarkson and Woodruff, 2013), 
but their analysis can not be directly applied to the more complicated system (5). 

This approximate linear system interpretation offers a new perspective on the Nystrom 
method. The U matrix of the Nystrom method is in fact an approximate solution to the 
problem minu ||CUC^ — K|||.. The Nystrom method uses S = P as the sketching matrix, 
which leads to the solution 

= argmin||P'^(CUC'^ - K)P||p = (P'^KP)^ = . 

u 


4.3 Error Analysis 

Let correspond to the fast model (5). Any of the five sketching methods in Lemma 2 
can be used to compute although column selection is more useful than random 

projection in this application. In the following we show that is nearly as good as 

XJ* in terms of the objective function value. The proof is in Appendix D. 

Theorem 3 (Main Resnlt) Let K be any n x n fixed symmetric matrix, C he any n x c 
fixed matrix, kc = rank(C), and be the cxc matrix defined in (5). Let S G he any 

of the five sketching matrices defined in Table 4- Assume that = o(re) or e~^ = o{n/c). 
The inequality 

||K-< (1 + e) imn||K-CUC'^llJ, (6) 

holds with probability at least 0.8. 

In the theorem, Gaussian projection and SRHT require smaller sketch size than the other 
three methods. It is because Gaussian projection and SRHT enjoys all of Properties 1, 2, 
3 in Lemma 2, whereas leverage score sampling, uniform sampling, and count sketch does 
not enjoy Property 3. 

Remark 4 Wang et al. (2016) showed that there exists an algorithm (though not linear¬ 
time algorithm) attaining the error bound 

||K-CCtK(Ct)^C'^||^ < (l + e)||K-Kfc||5, 
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Table 4: Leverage score sampling means sampling according to the row leverage scores of 
C. For uniform sampling, the parameter /r(C) G [1,?^] is the row coherence of C. 


Sketching 

Order of s 

Assumption 

^sketch 

^Entries 

Leverage Score Sampling 


e = o{n) 

0(nc^ + s^) 

nc+ {s — c)^ 

Uniform Sampling 

li{C)c^/n/e 

e = o{n) 

0{s^) 

nc+ {s — c)^ 

Gaussian Projection 

/I(c + log?) 

e = o{n/c) 

Cl(nnz(K)s) 

n? 

SRHT 

^^(c + logn)log(n) 

e = ofnjc) 

Olr? log s) 


Count Sketch 

c^/n/e 

e = o(n) 

C)(nnz(K)) 



Algorithm 1 The Fast SPSD Matrix Approximation Model. 

1: Input: an n X n symmetric matrix K and the number of selected columns or target dimension 
of projection c (< n). 

2: Sketching: C = KP using an arbitrary n x c sketching matrix P (not studied in this work); 

3: Optional: replace C by any orthonormal bases of the columns of C; 

4: Compute another n x s sketching matrix S, e.g. the leverage score sampling in Algorithm 2; 

5: Compute the sketches S'^C S and S'^KS G 
6: Compute = (S'^C)^(S^KS)(C'^S)t e 
7: Output: C and such that K « 


with high probability by sampling c = 0{kje) columns of K to form C. Let C G 
be formed by this algorithm and S G be the leverage score sampling matrix. With 

c = Oikje) and s = the fast model satisfies 

||K - CU/“*C^||p < (l + e)||K-Kfc||^ 


with high probability. 

4.4 Algorithm and Time Complexity 

We describe the whole procedure of the fast model in Algorithm 1, where S G can be 
one of the five sketching matrices described in Table 4. Given C and (the whole or a part 
of) K, it takes time 

O(s^c) + Tsketch 

to compute where Tsketch is the time cost of forming the sketches S^C and S^KS 

and is described in Table 4. In Table 4 we also show the number of entries of K that must 
be observed. From Table 4 we can see that column selection is much more efficient than 
random projection, and column selection does not require the full observation of K. 

We are particularly interested in the column selection matrix S corresponding to the row 
leverage scores of C. The leverage score sampling described in Algorithm 2 can be efficiently 
performed. Using the leverage score sampling, it takes time 0(n(?je) (excluding the time 
of computing C = KP) to compute For the kernel approximation problem, suppose 

that we are given n data points of dimension d and that the kernel matrix K is unknown 
beforehand. Then it takes 0{n(?dje) additional time to evaluate the kernel function values. 
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Algorithm 2 The Leverage Score Sampling Algorithm. 

1: Input: an n X c matrix C, an integer s. 

2: Compute the condensed SVD of C (by discarding the zero singular values) to obtain the 
orthonormal bases Uc G where p = rank(C) < c; 

3: Compute the sampling probabilities Pi = siijp, where li = HefUclH is the i-th leverage score; 
4: Initialize S to be an matrices of size n x 0; 

5: for i = 1 to n do 

6: With probability pi, add to be a new column of S, where is the Ath standard basis; 

7: end for 

8: Output: S, whose expected number of columns is s. 


4.5 Implementation Details 

In practice, the approximation accuracy and numerical stability can be significantly 
improved by the following techniques and tricks. 

If P and S are both random sampling matrices, then empirically speaking, enforcing 
V <Z S significantly improves the approximation accuracy. Here V and S are the subsets 
of [n] selected by P and S, respectively. Instead of directly sampling s indices from [n] by 
Algorithm 2, it is better to sample s indices from [n] \V to form S' and let 5 = S' UV. 
In this way, s + c columns are sampled. Whether the requirement V C S improves the 
accuracy is unknown to us. 


Corollary 5 Theorem 3 still holds when we restrict V C S. 


Proof Let pi, - ■ ■ be the original sampling probabilities without the restriction V C S. 
We define the modified sampling probabilities by 


r 1 if i G P; 

I Pi otherwise . 


The column sampling with restriction V <Z S amounts to sampling columns according to 
Pi, • ■ ■ ,Pn- Since pi > pi for all i £ [n], it follows from Remark 14 that the error bound will 
not get worse if pi is replaced by p*. ■ 


If S is the leverage score sampling matrix, we find it better not to scale the entries of 
S, although the scaling is necessary for theoretical analysis. According to our observation, 
the scaling sometimes makes the approximation numerically unstable. 

4.6 Additional Properties 

When K is a low-rank matrix, the Nystrom method and the prototype model are guaranteed 
to exactly recover K (Kumar et ah, 2009; Talwalkar and Rostamizadeh, 2010; Wang et al., 
2016). We show in the following theorem that the fast model has the same property. We 
prove the theorem in Appendix E. 

Theorem 6 (Exact Recovery) Let K be any n x n symmetric matrix, P G and 

S G be any sketching matrices, C = KP, and W = P^C. Assume that rank(S^C) > 
rank(W). Then K = C(S^C)t(S^KS)(C^S)tC^ if and only if rank(K) = rank(C). 
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In the following we establish a lower error bound of the fast model, which implies that 
to attain the 1 + e Frobenius norm bound relative to the best rank k approximation, the 
fast model must satisfy 

c>VL{k/e) and s > VLiy-\/ nk / e). 

Notice that the theorem only holds for column selection matrices P and S. We prove the 
theorem in Appendix F. 

Theorem 7 (Lower Bound) Let P G and S G be any two column selection 

matrices such that V C S C [n], where V and S are the index sets formed by P and S, 
respectively. There exists an n x n symmetric matrix K such that 

||K - K^ffWp n - c / 2k\ n - s k{n - s) 

||K —Kfc|||, ~ n — k\ c) n — k ’ 

where k is arbitrary positive integer smaller than n, C = KP G and 

Kif = C(S^C)^(S^KS)(C^S)^C'^ 

is the fast model. 

Interestingly, Theorem 7 matches the lower bounds of the Nystrom method and the 
prototype model. When s = c, the right-hand side of (7) becomes 0(1 -|- kn/c^), which is 
the lower error bound of the Nystrom method given by Wang and Zhang (2013). When 
s = n, the right-hand side of (7) becomes 0(1 -|- k/c), which is the lower error bound of the 
prototype model given by Wang et al. (2016). 

5. Extension to CUR Matrix Decomposition 

In Section 5.1 we describe the CUR matrix decomposition and establish an improved error 
bound of CUR in Theorem 8. In Section 5.2 we use sketching to more efficiently compute 
the U matrix of CUR. Theorem 8 and Theorem 9 together show that our fast CUR method 
satisfies 1 -|- e error bound relative to the best rank k approximation. In Section 5.3 we 
provide empirical results to intuitively illustrate the effectiveness of our fast CUR. In 
Section 5.4 we discuss the application of our results beyond the CUR decomposition. 

5.1 The CUR Matrix Decomposition 

Given any m x n matrix A, the CUR matrix decomposition is computed by selecting c 
columns of A to form C G and r rows of A to form R G and computing the U 

matrix such that ||A — CUR|||;. is small. CUR preserves the sparsity and non-negativity 
properties of A; it is thus more attractive than SVD in certain applications (Mahoney and 
Drineas, 2009). In addition, with the CUR of A at hand, the truncated SVD of A can be 
very efficiently computed. 

A standard way to finding the U matrix is by minimizing ||A — CUR|||. to obtain the 
optimal U matrix 

U* = argmin||A-CURllI = ATO, (8) 

u 
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which has been used by Stewart (1999); Wang and Zhang (2013); Boutsidis and Woodruff 
(2014). This approach costs time 0{m(P‘ + nr^) to compute the Moore-Penrose inverse and 
C)(mn-min{c, r}) to compute the matrix product. Therefore, even if C and R are uniformly 
sampled from A, the time cost of CUR is 0{mn • min{c, r}). 

At present the strongest theoretical guarantee is by Boutsidis and Woodruff (2014). 
They use the adaptive sampling algorithm to select c = 0{kje) column and r = 0{kje) 
rows to form C and R, respectively, and form U* = C^ARC The approximation error is 
bounded by 

||A-CU*R||| < (l + e)||A-Afclll,. 

This result matches the theoretical lower bound up to a constant factor. Therefore this 
CUR algorithm is near optimal. We establish in Theorem 8 an improved error bound of 
the adaptive sampling based CUR algorithm, and the constants in the theorem are better 
than the those in (Boutsidis and Woodruff, 2014). Theorem 8 is obtained by following the 
idea of Boutsidis and Woodruff (2014) and slightly changing the proof of Wang and Zhang 
(2013). The proof is in Appendix G. 

Theorem 8 Let A be any given m x n matrix, k be any positive integer less than m and 
n, and e G (0,1) be an arbitrary error parameter. Let C G and R G be columns 

and rows of A selected by the near-optimal column selection algorithm of Boutsidis et al. 
(2014-). When c and r are both greater than Ake~^{l + o(l)), the following inequality holds: 

E||A-CCtARtRllI < (l + e)||A-Afclll, 

where the expectation is taken w.r.t. the random column and row selection. 

5.2 Fast CUR Decomposition 

Analogous to the fast SPSD matrix approximation model, the CUR decomposition can 
be sped up while preserving its accuracy. Let Sc G and Sr G be any 

sketching matrices satisfying the approximate matrix multiplication properties. We propose 
to compute U more efficiently by 

U = argmin||SgASij-(S^C)U(RSi?)||| 

u 

= (S|2^(S|^(R^, (9) 

CXSc ScXSr SrXr 

which costs time 

0(srr^ + ScC^ + ScSr ■ min{c, r}) + Tsketch, 

where Tsketch denotes the time for forming the sketches S^ASr, sgc, and RSi?. As for 
Gaussian projection, SRHT, and count sketch, Tsketch are respectively O (nnz(A) min{sc, Sr}), 
C)(mrelog(min{sc, Sr})), and C)(nnz(A)). As for leverage score sampling and uniform 
sampling, Tsketch are respectively 0{mc^ + nr^ + ScSr) and 0{scSr). Forming the sketches 
by column selection is more efficient than by random projection. 

The following theorem shows that when Sc and Sr are sufficiently large, U is nearly as 
good as the best possible U matrix. In the theorem, leverage score sampling means that 
Sc and Sr sample columns according to the row leverage scores of C and R^, respectively. 
The proof is in Appendix H. 
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Theorem 9 Let A G C G R G be any fixed matrices with c ^ n and 

r m. Let q = min{m,n} and q = min{m/c, n/r}. The sketehing matrices Sc G 
and Sr G are described in Table 5. Assume that e~^ = o{q) or e~^ = o{q), as shown 

in the table. The matrix U is defined in (9). Then the inequality 

||A-CUR||l < (l + e)min||A-CUR||l 

u 

holds with probability at least 0.7. 


Table 5: Leverage score sampling means sampling according to the row leverage scores of 
C and the column leverage scores of R, respectively. For uniform sampling, the 
parameter /u(C) is the row coherence of C and i^(R) is the column coherence of 

R. 


Sketching 

Order of Sc 

Order of Sr 

Assumption 

Leverage Score Sampling 


r\Jqfi 

= o(g) 

Uniform Sampling 

n{C)c^/qJl 


= o{q) 

Gaussian Projection 

/I(cAlogf) 

^/^(^ + logv) 

£-1 = o(g) 

SRHT 

^(c + logi^)log(m) 

^(r + log^!f)log(n) 

£-1 = o\q) 

Count Sketch 


r'Jqfi 

£“^ = o(g) 


As for leverage score sampling, uniform sampling, and count sketch, the sketch sizes 
Sc = 0[c-sjqje) and Sr = Oir-sJqje) suffice, where q = min{m, n}. As for Gaussian 
projection and SRHT, much smaller sketch sizes are required: Sc = 0{-\fmcfe) and 
Sr = 0{\/nrJe) suffice. However, these random projection methods are inefficient choices 
in this application and only have theoretical interest. Only column sampling methods have 
linear time complexities. If Sc and Sr are leverage score sampling matrices (according to 
the row leverage scores of C and R^, respectively), it follows from Theorem 9 that U with 
1 + e bound can be computed in time 

Oi^Srr ScC A ScSr * millj^C, A Tsketch — ^ (ere ^ • min{m, n} ■ min{c, r}), 
which is linear in C)(min{m, n}). 

5.3 Empirical Comparisons 

To intuitively demonstrate the effectiveness of our method, we conduct a simple experiment 
on a 1920 x 1168 natural image obtained from the internet. We first uniformly sample 
c = 100 columns to form C and r = 100 rows to form R, and then compute the U matrix 
by varying Sc and Sr- We show the image A = CUR in Figure 2. 

Figure 2(b) is obtained by computing the U matrix according to (8), which is the best 
possible result when C and R are fixed. The U matrix of Figure 2(c) is computed according 
to Drineas et al. (2008): 

U = (PSAPc)^ 
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where Pc and P/j are column selection matrices such that C = APc and R = P^A. This 
is equivalently to (9) by setting Sc = Pi? and Sr = Pc- Obviously, this setting leads to 
very poor quality. In Figures 2(c) and (d) the sketching matrices Sc and Sr are uniform 
sampling matrices. The figures show that when Sc and Sr are moderately greater than r 
and c, respectively, the approximation quality is significantly improved. Especially, when 
Sc = 4r and Sr = 4c, the approximation quality is nearly as good as using the optimal U 
matrix defined in ( 8 ). 




(c) Sc = r,Sr = c. 


(d) Sc = 2r, Sr = 2c. 


(e) Sc = 4r, Sr = 4c. 


Figure 2: (a): the original 1920 x 1168 image, (b) to (e): CUR decomposition with c = r = 
100 and different settings of Sc and Sr- 


5.4 Discussions 

We note that we are not the first to use row and column sampling to solve the CUR problem 
more efficiently, though we are the first to provide rigorous error analysis. Previous work 
has exploited similar ideas as heuristics to speed up computation and to avoid visiting every 
entry of A. For example, the MEKA method (Si et al., 2014a) partitions the kernel matrix 
K into blocks (^ = ... ^5 g-nd j = 1, • • • , 6 ), and requires solving 

L(*d) = argmin ||wWlw(^')^-K(*4)||2 

L 

for all z G [ 6 ], j G [ 6 ], and i 7 ^ j. Since and have much more rows than columns, 
Si et al. (2014a) proposed to approximately solve the linear system by uniformly sampling 
rows from and columns from (W^-^^)^ and gnO they noted that this 

heuristic works pretty well. The basic ideas of our fast CUR and their MEKA are the 
same; their experiments demonstrate the effectiveness and efficiency of this approach, and 


17 



























Wang, Zhang, and Zhang 


Table 6: A summary of the datasets for kernel approximation. 


Dataset 

Letters 

PenDigit 

Cpusmall 

Mushrooms 

WineQuality 

^Instance 

15,000 

10,992 

8,192 

8,124 

4,898 

^Attribute 

16 

16 

12 

112 

12 

a (when rj = 0.90) 

0.400 

0.101 

0.075 

1.141 

0.314 

a (when rj = 0.99) 

0.590 

0.178 

0.180 

1.960 

0.486 


our analysis answers why this approach is correct. This also implies that our algorithms 
and analysis may have broad applications and impacts beyond the CUR decomposition and 
SPSD matrix approximation. 






(i) Wine, 77 = 0.9. (j) Wine, -q — 0.99. 


-e— Fast (leverage) 
* Fast (uniform) 

' Nystrom 
- - Prototype 


(k) Legend. 


Figure 3: The plot of ^ against the approximation error ||K — CUC'^|||,/||K|||,, where C 
contains c = [n/100] column of K G selected by uniform sampling. 
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(i) Wine, 77 = 0.9. (j) Wine, rj — 0.99. 


-e— Fast (leverage) 
+ - Fast (uniform) 

' Nystrom 
- - Prototype 


(k) Legend. 


Figure 4: The plot of ^ against the approximation error ||K — CUC^|||./||K|||,, where C 
contains c = [n/100] column of K G selected by the uniform+adaptive^ 

sampling algorithm (Wang et ah, 2016). 


6 . Experiments 

In this section we conduct several sets of illustrative experiments to show the effect of the 
U matrix. We compare the three methods with different settings of c and s. We do not 
compare with other kernel approximation methods for the reasons stated in Section 3.2.2. 

6.1 Setup 

Let X = [xi,..., x„] be the dx n data matrix, and K be the RBF kernel matrix with each 

II II2 

entry computed by Kij = exp ( — lPh_Till 2 ^ where a is the scaling parameter. 

When comparing the kernel approximation error ||K — CUC^|||^, we set the scaling 
parameter a in the following way. We let k = [n/100] and define 

^ l|Kf^ Er=i'^?(K)’ 
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which indicate the importance of the top one percent singular values of K. In general r] 
grows with a. We set a such that r] = 0.9 or 0.99. 

All the methods are implemented in MATLAB and run on a laptop with Intel i5 2.5GHz 
CUP and 8GB RAM. To compare the running time, we set MATLAB in the single thread 
mode. 

6.2 Kernel Approximation Accuracy 

We conduct experiments on several datasets available at the LIBSVM site. The datasets are 
summarized in Table 6. In this set of experiments, we study the effect of the U matrices. 
We use two methods to form C G uniform sampling and the uniform+adaptive^ 

sampling (Wang et al., 2016); we fix c = [n/100]. For our fast model, we use two kinds of 
sketching matrices S G uniform sampling and leverage score sampling; we vary s from 

2c to 40c. We plot ^ against the approximation error ||K — CUC^|||./||K|||, in Figures 3 
and 4. The Nystrom method and the prototype model are included for comparison. 

Figures 3 and 4 show that the fast SPSD matrix approximation model is significantly 
better than the Nystrom method when s is slightly larger than c, e.g., s = 2c. Recall that 
the prototype model is a special case of the fast model where s = n. We can see that the 
fast model is nearly as accurate as the prototype model when s is far smaller than n, e.g., 
s = 0.2n. 

The results also show that using uniform sampling and leverage score sampling to 
generate S does not make much difference. Thus, in practice, one can simply compute 
S by uniform sampling. 

By comparing the results in Figures 3 and 4, we can see that computing C by 
uniform+adaptive^ sampling is substantially better than uniform sampling. However, 
adaptive sampling requires the full observation of K; thus with uniform+adaptive^ 
sampling, our fast model does not have much advantage over the prototype model in terms 
of time efficiency. Our main focus of this work is the U matrix, so in the rest of the 
experiments we simply use uniform sampling to compute C. 

6.3 Approximate Kernel Principal Component Analysis 

We apply the three methods to approximately compute kernel principal component analysis 
(KPCA), and contrast with the exact solution. The experiment setting follows Zhang and 
Kwok (2010). We fix k and vary c. For our fast model, we set s = 2c, 4c, or 8c. Since 
computing S by uniform sampling or leverage score sampling yields the same empirical 
performance, we use only uniform sampling. Let CUC^ be the low-rank approximation 
formed by the three methods. Let VAV^ be the /^-eigenvalue decomposition of CUC^. 

6.3.1 Quality of the Approximate Eigenvectors 

Let UK,fc £ contain the top k eigenvectors of K. In the first set of experiments, we 

measure the distance between Uk,*; and the approximate eigenvectors V by 

Misalignment = ^||Uk,A: - VV^UK,fc||^ (G [0,1]). (10) 

Small misalignment indicates high approximation quality. We fix A: = 3. 
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(b) PenDigit, rj = 0.99. (c) Cpusmall, rj = 0.9. 



Time (s) Time (s) 


(d) Cpusmall, rj = 0.99. 


(e) Mushrooms, rj — 0.9. 


(f) Mushrooms, rj = 0.99. 



- - - ■ Nystrom 

-Prototype 

■ - o - Fast (s=2c) 

- Fast (s=4c) 

' " Fast (s=8c) 


(i) Legend. 


Figure 5: The plot of (log-scale) elapsed time against the (log-scale) misalignment defined 
in (10). 


We conduct experiments on the datasets summarized in Table 6. We record the elapsed 
time of the entire procedure—computing (part of) the kernel matrix, computing C and 
U by the kernel approximation methods, computing the ^-eigenvalue decomposition of 
CUC^. We plot the elapsed time against the misalignment defined in Figure 5. Results 
on the Letters dataset are not reported because the exact fc-eigenvalue decomposition on 
MATLAB ran out of memory, making it impossible to calculate the misalignment. 

At the end of Section 3.2.1 we have mentioned the importance of memory cost of 
the kernel approximation methods and that all three compared methods cost 0{nc + nd) 
memory. Since n and d are fixed, we plot c against the misalignment in Figure 6 to show 
the memory efficiency. 

The results show that using the same amount of time or memory, the misalignment 
incurred by the Nystrom method is usually tens of times higher than our fast model. The 
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(a) PenDigit, rj — 0.9. (b) PenDigit, r/ = 0.99. (c) Cpusmall, rj = 0.9. 


E 

o) 10 



(d) Cpusmall, rj = 0.99. 


(e) Mushrooms, rj — 0.9. 


(f) Mushrooms, r] = 0.99. 



- - - ■ Nystrom 

-Prototype 

■ - o - Fast (s=2c) 

- Fast (s=4c) 

' " Fast (s=8c) 


(g) Wine, r] = 0.9. 


(h) Wine, rj = 0.99. 


(i) Legend. 


Figure 6; The plot of c against the (log-scale) misalignment defined in (10). 


Table 7: A summary of the datasets for clustering and classification. 


Dataset 

MNIST 

Pendigit 

USPS 

Mushrooms 

Gisette 

DNA 

i^itlnstance 

60, 000 

10,992 

9,298 

8,124 

7,000 

2,000 

i^itAttribute 

780 

16 

256 

112 

5,000 

180 

#Class 

10 

10 

10 

2 

2 

3 

Scaling Parameter a 

10 

0.7 

15 

3 

50 

4 


experiment also shows that with fixed c, the fast model is nearly as accuracy as the prototype 
model when s = 8c <C n. 

6.3.2 Quality of the Generalization 

In the second set of experiments, we test the generalization performance of the kernel 
approximation methods on classification tasks. The classification datasets are described in 
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(a) MNIST 


(b) PenDigit 


(c) USPS 


(d) Mushrooms 




(e) Gisette (f) DNA 


■ - - ■ Nystrom 

-Prototype 

- o - Fast (s=2c) 

- Fast (s=4c) 

♦ ■ Fast (s=8c) 


(g) Legend 


Figure 7: The plot of c against the classification error. Here k = 3. 



(a) MNIST 


(b) PenDigit 


(c) USPS (d) Mushrooms 




(e) Gisette 


(f) DNA 


■ - - ■ Nystrom 

-Prototype 

- <»■ - Fast (s=2c) 

- Fast (s=4c) 

♦ ■ Fast (s=8c) 


(g) Legend 


Figure 8; The plot of elapsed time against the classification error. Here k = 3. 


Table 7. For each dataset, we randomly sample ni = 50%n data points for training and the 
rest 50%n for test. In this set of experiments, we set k = 3 and k = 10. 

We let K G be the RBF kernel matrix of the training data and k(x) G 

II II2 

be defined by [k(x)]i = exp ( — where Xj is the Ath training data point. In the 

training step, we approximately compute the top k eigenvalues and eigenvectors, and denote 
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(a) MNIST 


(b) PenDigit 


(c) USPS 


(d) Mushrooms 




- - - ■ Nystrom 
-Prototype 

- o- - Fast (s=2c) 
- Fast (s=4c) 

* ' Fast (s=8c) 


(e) Gisette 


(f) DNA 


(g) Legend 


Figure 9: The plot of c against the classification error. Here k = 10. 






(a) MNIST 


(b) PenDigit 


(c) USPS 


(d) Mushrooms 




(e) Gisette (f) DNA 


■ - - ■ Nystrom 

-Prototype 

- o- - Fast (s=2c) 

- Fast (s=4c) 

♦ ■ Fast (s=8c) 

(g) Legend 


Figure 10: The plot of elapsed time against the classification error. Here k = 10. 


A G and V G The feature vector (extracted by KPCA) of the z-th training 

data point is the i-th column of A ' V^. In the test step, the feature vector of test data 
X is A ’ V^k(x). Then we put the training labels and training and test features into the 
MATLAB K-nearest-neighbor classifier knnclassify to classify the test data. We fix the 
number of nearest neighbors to be 10. The scaling parameters of each dataset are listed in 
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Table 7. Since the kernel approximation methods are randomized, we repeat the training 
and test procedure 20 times and record the average elapsed time and average classification 
error. 

We plot c against the classification error in Figures 7 and 9, and plot the elapsed time 
(excluding the time cost of KNN) against the classification error in Figures 8 and 10. Using 
the same amount of memory, the fast model is signihcantly better than the Nystrom method, 
especially when c is small. Using the same amount of time, the fast model outperforms the 
Nystrom method by one to two percent of classihcation error in many cases, and it is at 
least as good as the Nystrom method in the rest cases. This set of experiments also indicate 
that the fast model with s = 4c or 8c has the best empirical performance. 





- - - ■ Nystrom 
-Prototype 

- o- - Fast (s=2c) 
- Fast (s=4c) 

■ +■ Fast (s=8c) 


(e) Gisette 


(f) DNA 


(g) Legend 


Figure 11: The plot of c against NMI. 


6.4 Approximate Spectral Clustering 

Following the work of Fowlkes et al. (2004), we evaluate the performance of the kernel 
approximation methods on the spectral clustering task. We conduct experiments on the 
datasets summarized in Table 7. 

We describe the approximate spectral clustering in the following. The target is to 
cluster n data points into k classes. We use the RBF kernel matrix K as the weigh matrix 
and let CUC^ ~ K be the low-rank approximation. The degree matrix D = diag(d) 
is a diagonal matrix with d = CUC^ln, and the normalized graph Laplacian is L = 
In - D-V2(CUC^)D^ The bottom k eigenvectors of L are the top k eigenvectors of 

^ V ^ V 

nxc cxn 

which can be efficiently computed according to Appendix A. We denote the top k 
eigenvectors by V £ We normalize the rows of V and take the normalized rows 
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0.56 

0.54 

0.52 

1 

Z 0.5 
0.48 
0.46 
0.44' 


Time (s) 

(d) Mushrooms 




(f) DNA 


- - - ■ Nystrom 
-Prototype 

- o- - Fast (s=2c) 
- Fast (s=4c) 

Fast (s=8c) 


(g) Legend 


Figure 12: The plot of elapsed time against NMI. 


of V as the input of the /c-means clustering. Since the matrix approximation methods are 
randomized, we repeat this procedure 20 times and record the average elapsed time and the 
average normalized mutual information (NMI)^ of clustering. 

We plot c against NMI in Figure 11 and the elapsed time (excluding the time cost 
of /c-means) against NMI in Figure 12. Figure 11 shows that using the same amount of 
memory, the performance of the fast model is better than the Nystrom method. Using 
the same amount of time, the fast model and the Nystrom method have almost the same 
performance, and they are both better than the prototype model. 

7. Concluding Remarks 

In this paper we have studied the fast SPSD matrix approximation model for approximating 
large-scale SPSD matrix. We have shown that our fast model potentially costs time linear 
in n, while it is nearly as accurate as the best possible approximation. The fast model is 
theoretically better than the Nystrom method and the prototype model because the latter 
two methods cost time quadratic in n to attain the same theoretical guarantee. Experiments 
show that our fast model is nearly as accurate as the prototype model and nearly as efficient 
as the Nystrom method. 

The technique of the fast model can be straightforwardly applied to speed up the CUR 
matrix decomposition, and theoretical analysis shows that the accuracy is almost unaffected. 
In this way, for any m x n large-scale matrix, the time cost of computing the U matrix 
drops from 0{mn) to 0{mm{m,n}). 


3. NMI is a standard metric of clustering. NMI is between 0 and 1. Big NMI indicates good clustering 
performance. 
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Appendix A. Approximately Solving the Eigenvalne Decomposition and 
Matrix Inversion 

In this section we show how to use the SPSD matrix approximation methods to speed up 
eigenvalue decomposition and linear system. The two lemmas are well known results. We 
show them here for the sake of self-containing. 

Lemma 10 (Approximate Eigenvalue Decomposition) Given C G and U G 

]^cxc_ ^/jg eigenvalue decomposition o/K = CUC'^ can he computed in time 0{nc^). 

Proof It cost 0{nc^) time to compute the SVD 



nxc cxc cxc 


and O(c^) time to compute Z = (EcV^)U(ScVq)^ G It costs 0{c^) time to 

compute the eigenvalue decomposition Z = VzAzV^. Combining the results above, we 
obtain 

CUC^ = (Uc5]cVg)U(UcScVS)'^ 

= UcZUg = (UcVz)Az(UcVz)'^. 

It then cost time O(nc^) to compute the matrix product UcVz- Since (UcVz) has 
orthonormal columns and Az is diagonal matrix, the eigenvalue decomposition of CUC^ 
is solved. The total time cost is C>(nc^) -|- O(c^) = 0{nc^). ■ 

Lemma 11 (Approximately Solving Matrix Inversion) Given C G SPDS 

matrix U G vector y G and arbitrary positive real number a. Then it costs 

time 0{nc^) to solve the n x n linear system (CUC'^ -|- aln)w = y to obtain w G 

In addition, if the SVD of C is given, then it takes only 0{c^ nc) time to solve the 
linear system. 

Proof Since the matrix (CUC^ -|- aln) is nonsingular when a > 0 and U is SPSD, the 
solution is w* = (CUC^ -|- Q;I„)“^y. Instead of directly computing the matrix inversion, 
we can expand the matrix inversion by the Sherman-Morrison-Woodbury matrix identity 
and obtain 

(CUC^ + aln)~^ = a“^In - + C^Cy^C^. 
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Thus the solution to the linear system is 

w* = a" V - (aU-i + C^C)"^ C^y. 

nxc cxn 


Suppose we are given only C and U. The matrix multiplication C^C costs time O(nc^), 
the matrix inversions cost time O(c^), and multiplying matrix with vector costs time 0{nc). 
Thus the total time cost is 0{nc^) + 0{c^) + 0{nc) = 0{nc^). 

Suppose we are given U and the SVD C = UcScV^. The matrix product 

C^C = Vc^lcUgUcScVc = VcS^Vc 

can be computed in time O(c^). Thus the total time cost is merely 0{c^ + nc). ■ 


Appendix B. Proof of Theorem 1 

The prototype model trivially satisfies requirement R1 with e = 0. However, it violates 
requirement R2 because computing the U matrix by solving minu ||K ~ CUC^||^ costs 
time O(n^c). 

For the Nystrom method, we provide such an adversarial case that assumptions A1 and 
A2 can both be satisfied and that requirements R1 and R2 cannot hold simultaneously. The 
adversarial case is the block diagonal matrix 

K = diag(B,... ,B), 

k blocks 


where 


B = (1 — a)Ip + alplp, a < 1, and P = w, 

and let a —)• 1. Wang et al. (2016) showed that sampling c = 3 /c 7 “^(l + o(l)) columns 
of K to form C makes assumptions Al and A2 in Question 1 be satisfied. This indicates 
that C is a good sketch of K. The problem is caused by the way the matrix is 

computed. Wang and Zhang (2013, Theorem 12) showed that to make requirement R1 in 
Question 1 satisfied, c must be greater than H(y^ nk /(e + 7 )). Thus it takes time O(nc^) = 
(e + 7 )) to compute the rank-fc eigenvalue decomposition of or the linear 

system (CU^^^C^ + al„)w = y. Thus, requirement R2 is violated. 


Appendix C. Proof of Lemma 2 

Lemma 2 is a simplified version of Lemma 12. We prove Lemma 12 in the subsequent 
subsections. In the lemma, leverage score sampling means that the sampling probabilities 
are proportional to the row leverage scores of U G For uniform sampling, /r(U) is 

the row coherence of U. 
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Lemma 12 Let U G be any fixed matrix with orthonormal columns and B G 

be any fixed matrix. Let S G be any sketching matrix described in Table 8. Then 



p{||U^SS^U-Ifc ||2 > (Property 1), 

P|||U^B-U^SS'^BjlJ, > €||B|||} < ^2 (Property 2), 

J . 

U'^B-U^SS^Bjl^ > e'||B||^ +-||B|||| < (Property 3). 


Table 8: The sketch size s for satisfying the three properties. For SRHT, we define A = 
(l + v^8A:-Mog(100n))^ and A' = (^1 + log . 


Sketching 

Leverage Sampling 
Uniform Sampling 
SRHT 

Gaussian Projection 
Count Sketch 


Property 1 


kW I 

M(U)fc^logA 

9 {\/k+y/2 log(2/5i)) 
k^ + k 

SirL 


Property 2 
IT 

eS2 

M(U)fc 

£^2 

6(52-0.01) 

18fc 

6^2 

2k 

6^2 


Property 3 


^(l + ^fc-Mog^)' 


C.l Column Selection 

In this subsection we prove Property 1 and Property 2 of leverage score sampling and 
uniform sampling. We cite the following lemma from (Wang et al., 2016); the lemma was 
firstly proved by the work Drineas et al. (2008); Gittens (2011); Woodruff (2014). 

Lemma 13 Let U G be any fixed matrix with orthonormal columns. The column se¬ 

lection matrixS G samples s columns according to arbitrary probabilities pi,p 2 , ■ ■ ■ ,Pn- 
Assume a > k and 

||uj. II 2 

max- < a. 

ie[n] Pi 

If s > log(A;/5i), it holds that 

p|||lfc-U^SS^U||2 > v} < <5i. 

If s > it holds that 

p{||UB-U^SS^B||p > e||B||^} < da- 

I|U' IP 

Leverage score sampling satisfies maxjgj^j ^ < k. Uniform sampling satisfies 
l|u- IP 

maxjg[„] p, ^ < p,([J)k, where ^(U) is the row coherence of U. Then Property 1 and 
Property 2 of the two column sampling methods follow from Lemma 13. 


29 























Wang, Zhang, and Zhang 


Remark 14 Let pi, ■ ■ ■ ,pn be the sampling probabilities corresponding to the leverage score 
sampling or uniform sampling, and let pi G \pi, 1] for all i G [n] be arbitrary. For all i G [n], 
if the i-th column is sampled with probability spi and scaled by if it gets sampled, then 
Lemma 2 still holds. This can be easily seen from the proof of the above lemma (in (Wang 
et al, 2016)). Intuitively, it indicates that if we increase the sampling probabilities, the 
resulting error bound will not get worse. 


C.2 Count Sketch 

Count sketch stems from the data stream literature (Charikar et ah, 2004; Thorup and 
Zhang, 2012). Theoretical guarantees were first shown by Weinberger et al. (2009); Pham 
and Pagh (2013); Clarkson and Woodruff (2013). Meng and Mahoney (2013); Nelson and 
Nguyen (2013) strengthened and simplified the proofs. Because the proof is involved, we 
will not show the proof here. The readers can refer to (Meng and Mahoney, 2013; Nelson 
and Nguyen, 2013; Woodruff, 2014) for the proof. 


C.3 Property 1 and Property 2 of SRHT 

The properties of SRHT were established in the previous work (Drineas et al., 2011; Lu et al., 
2013; Tropp, 2011). Following (Tropp, 2011), we show a simple proof of the properties of 
SRHT. Our analysis is based on the following two key observations. 

• The scaled Walsh-Hadamard matrix and the diagonal matrix D are both 

orthogonal, so is also orthogonal. If U has orthonormal columns, the matrix 

;^(DH„)^U has orthonormal columns. 

• For any fixed matrix U G [k <C n) with orthonormal columns, the matrix 

^(DH„)^U G has low row coherence with high probability. Tropp (2011) 

showed that the row coherence of ^(DH„)^U satisfies 


A n 

p = — max 
k 




81og(n/(I) 

k 


2 


with probability at least 1 — 5. In other words, the randomized Hadamard transform 
flats out the leverage scores. Consequently uniform sampling can be safely applied to 
form a sketch. 


In the following, we use the properties of uniform sampling and the bound on the 
coherence p to analyze SRHT. Let V = ^(DH„)^U G B = ^(DH„)^B G 

and p be the row coherence of V. It holds that 

V'^V = U'^U = Ifc, V'^PP^V = U'^SS^U, 

V'^B = U'^B, V'^PP'^B = U^SS'^B, ||B||ir = ||B||f, 

P|/i > (l + V8fc-Mog(100n))^} < 0.01. 
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Therefore it suffices to prove that 

P|||lfc - V^PP^V||2 > v} < 5i-0.01, 
p|||VB-V^PP'^BjlJ, > e||B|||| < 52-0.01. 

The above inequalities follows from the two properties of uniform sampling. 

C.4 Property 1 and Property 2 of Gaussian Projection 

The two properties of Gaussian projection can be found in (Woodruff, 2014). In the following 
we prove Property 1 in a much simpler way than (Woodruff, 2014). 

The concentration of the singular values of standard Gaussian matrix is very well known. 
Let G be an n X s (n > s) standard Gaussian matrix. For any fixed matrix U G 
with orthonormal columns, the matrix N = G^U G is also standard Gaussian matrix. 
Vershynin (2010) showed that for every t > 0, the following holds with probability at least 

y/s — Vk — t < CJfc(N) < (Ti(N) < y/s + Vk + t. 

Therefore, for any rj G (0,1), if s = 9ri~‘^(y/k + Y^2log(27^)^, then 

(Tj(U^SS^U) = iTf(S^U) G 1 ± 7/ for alH G [n] 
hold simultaneously with probability at least 1 — 5i. Hence 

p{||lfc-U^SS^U||2 > r/} < 5i. 

This conclndes Property 1 of Gaussian projection. 

C.5 Property 3 of SRHT and Gaussian Projection 

The following lemma is the main result of (Cohen et al., 2015). If a sketching method 
satisfies Property 1 for arbitrary column orthogonal matrix U, then it satisfies Property 3 
due to the following lemma. Notice that the lemma does not apply to the leverage score 
and uniform sampling because they depends on the leverage scores or matrix coherence 
of specific column orthogonal matrix U. The lemma is inappropriate for count sketch 
because Property 1 of count sketch holds with constant probability rather than arbitrary 
high probability. 

Lemma 15 Let A G and B G i^nxd jij-Qfji rnatrices and r he any fixed integer. 

Let k > k and d > d he the least integer divisible hy r. Let S G be a certain data- 

independent sketching matrix satisfying 

p|||U^SS^U-l||2 > r?| < is 

I" "2 - - kd 

for any fixed matrix U G orthonormal columns. Then 

A^SS^B-A^B||^ < r?(||A||i + ~ (||B||i + ~ 

holds with probability at least 1 — ^ 3 . 
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SRHT and Gaussian projection enjoys Property 1 with high probability for arbitrary 
column orthogonal matrix U. Thus Property 3 can be immediately obtained by applying 
the above lemma with the setting r = k. 


Appendix D. Proof of Theorem 3 

Let K G be any fixed SPSD matrix, C G be any fixed matrix, S G be a 

sketching matrix, and 

U* = argminllK-CUC^IIp = C^K{C^)\ 

u 

U = argmin||S^(K-CUC^)S||p = (S^C)'f(S^KS)(C^S)^ 

u 

Lemma 16 is a direct consequence of Lemma 24. 


Lemma 16 Let K G be any fixed SPSD matrix, C G be any fixed matrix, and 

C = UcScVg be the SVD. Assume that S^Uc has full column rank. Let U* and U be 
defined in the above. Then the following inequality holds: 

||K-CUC^||| < \\A-CV*C^\\l+(2fVh + f,/gfgfiy, 
where a G [0,1] is arbitrary and 

f = ^mL(uSSS^Uc), h = ||uSSS^(K - UcUSk)|| J, 

52 = ||uSSS^(R - UcUS)K“g, gp = ||uSSS'^(I„ - UcUgjK^-^Hj,. 

The following lemma shows that X is nearly as good as X* in terms of objective function 
value if S satisfies Assumption 1. 


Assumption 1 Let B be any fixed matrix. Let C G and C = UcXcV^ be the SVD. 

Assume that the sketching matrix S G satisfies 

p{||UcSS^Uc-l||2 > ^} < <5i 

p{||ugSS^B-UgB||^ > e||B|||.| < 82 

for any 81,82 £ (0,1/3). 

Lemma 17 Let K G be any fixed SPSD matrix, C G be any fixed matrix, 

and C = UcXcVg be the SVD. Let U* and U be defined in the above, respectively. Let 
S G be certain sketching matrix satisfying Assumption 1. Assume that = o(n). 

Then 


IK - CUC^II^ - llA - CU*C 


< 


/ 20 Vi I 
V 9 ' 


A - CU*C^| 


ri|2 
F 

lOOe,, 


^^81 


(K - UcUg)K|| 


< 4e^n A-CU*C 


T||2 

\f- 


holds with probability at least 1 — 81 — 282 . 
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Proof Let /, h, g 2 , gp, ct be defined in Lemma 16 and fix a = 1/2. Under Assumption 1 
it holds simultaneously with probability at least 1 — — 252 that 

/<y, h<e\\{ln-VcVl)K\\l, 52 <<7F <e||(I„-UcUg)Ki/2||^. 

It follows that 


52<5f < e-tr((I„-UcUg)KV2Ki/2(i„_UcUS)) 

< e-tr((I„-UcUg)K(I„-UcUg)) 

= e||(I„-UcUg)K(I„-UcUS)||^ 

< e||(I„-UcUg)K||^. 


It follows from Lemma 16 and the assumption e ^ = o(n) that 


K - CUC^ 


- A-CU*C' 


< 

< 


(^\\A - CU*C^||^ + ^||(In - UcUg)K|' 


V 9 

f 20 y^||A_cu*C^| 

V 9 " ' 

lO'^e^n 


+ ^°^l|(In-UcUg)K|| '' 


94 


(l + o(l))||A-CU*C 


ri |2 


by which the lemma follows. 


Under both Assumption 1 and Assumption 2, the error bound can be further improved. 
We show the improved bound in Lemma 18. 

Assumption 2 Let B be any fixed matrix. Let C G kc = rank(C), and C = 

UcScVg be the SVD. Assume that the sketching matrix S G satisfies 

p|||UgSS^B-UgB ||2 > e||B||^ + ^||B|||| < dg 

for any 63 G (0,1/3). 

Lemma 18 Let K G be any fixed SPSD matrix, C G be any fixed matrix, 

kc = rank(C), and C = Uc^IcVg be the SVD. Let U* and U be defined in the beginning 
of this section. Let S G be certain sketching matrix satisfying both Assumption 1 and 

Assumption 2. Assume that e = oinjkfi). Then 

||K-CUC^||5, < ||A-CU*C^||p + 4eV^c||A-CU*C^||5, 
holds with probability at least 1 — di — <52 — <^ 3 - 
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Proof Let /, h, § 2 , Qf, oi be defined in Lemma 16 and fix a = 0. Under Assumption 1 it 
holds simultaneously with probability at least 1 — <5i — ^2 that 

/<y, h = gF<e\\{ln-VcUl)K\\l. 

Under Assumption 2, it holds with probability at least 1 — ^3 that 

52 = ||uSSS^(In-UcUS)+Ug(I^-UcUS)||2 

=0 

< e\\ln-VcVc\\l + ^\\ln-UcVl\\l < e + ^{n-k,) = 

Kc f^c 

It follows from Lemma 16 and the assumption = o{njkc) that 

||K - CUC'^II^ - ||A - CU*C'^||^ 

< (^||A - CU*C^||^ + ||A - CU*C^||^)' 

< 4eVfcc||A-CU*C'^||^, 

by which the lemma follows. ■ 


Finally, we prove Theorem 3 using Lemma 17 and Lemma 18. Leverage score sampling, 
uniform sampling, and count sketch satisfy Assumption 1, and the bounds follow by setting 
e = 0.5y?7 n and applying Lemma 17. For the three sketching methods, we set di = 0.01 
and 62 = 0.095. 

Gaussian projection and SRHT satisfy Assumption 1 and Assumption 2, and their 
bounds follow by setting e = O.Sy^e'fcc/^ and applying Lemma 18. For Gaussian projection, 
we set (5i = 0.01, 62 = 0.09, and ds = 0.1. For SRHT, we set di = 0.02, 62 = 0.08, and 

<53 = 0.1. 

Appendix E. Proof of Theorem 6 

Since C = KP G W = P^C G and rank(S^C) > rank(W), we have that 

rank(K) > rank(C) > rank(S^C) > rank(W). (11) 

If rank(C) = rank(K), there exists a matrix X such that K = CX. By left multiplying 
both sides by P^, it follows that 

= P^K = P^CX = WX, 

and thus rank(W) = rank(S^C) = rank(C) = rank(K). It follows from K = CX and 
C = X'^W that 

K = X^WX. 
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We let $ = XS, and it holds that 

= C(S^C)'f(S^KS)(C^S)'fC^ 

= X^W(S^X'^W)t(S^X'^WXS)(WXS)'^WX 
= X^W($^W)t($^W$)(W$)'^WX. 

Let rank(W) = rank(C) = rank(S'^C) = rank(K) = p. Since W is symmetric, we denote 
the rank-/9 eigenvalue decomposition of W by 

W = UwAwU^. 

exp pxp pxc 

Since S'^C = and rank(S^C) = rank(W) = p, we have that rank($^W) = 

rank(W) = p. The n x p matrix must have full column rank, otherwise 

rank($^W) < p. Thus we have 

($^W)t = ($^UwAwU^)t = (AwU^)t($^Uw)^. 

It follows that 

Kf"f = X^W(AwU^)t($'^Uw)U^^Uw)Aw(U^^)(U^^)^(UwAw)^WX 

’ V 

^ ‘V* “V" ^ V* 

exp pxn nxp 

= X^UwAwUwX = X'^WX = K. 

This shows that the fast model is exact. To this end, we have shown that if rank(C) = 
rank(K), then the fast model is exact. 

Conversely, if the fast model is exact, that is, K = C(S^C)f(S^KS)(C^S)^C^, we have 
that rank(K) < rank(C). It follows from (II) that rank(K) = rank(C). 


Appendix F. Proof of Theorem 7 

We prove Theorem 7 by constructing an adversarial case. Theorem 7 is a direct consequence 
of the following theorem. 


Theorem 19 Let A he the n x n symmetric matrix defined in Lemma 21 with a —>• 1 and 
k be any positive integer smaller than n. Let V he any subset of [n] with cardinality c and 
C G contain c columns of A indexed by V. Let S he any nx s column selection matrix 
satisfying V C S, where S C [n] is the index set formed by S. Then the following inequality 
holds: 


A - C(S^C)t(S^AS)(C^S)tC'^||| 
l|A - Afc|||, 


> 


n — c 
n — k 




n — s k{n — s) 
n — k 


Proof Let A and B be defined in Lemma 21. We prove the theorem using Lemma 21 
and Lemma 23. Let n = pk. Let C consist of c column sampled from A and Cj consist 
of Ci columns sampled from the i-th diagonal block of A. Thus C = diag(Ci,-- - ,Cfc)- 
Without loss of generality, we assume Cj consists of the first Ci columns of B. Let S = 
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diag(Si, • • • , Sfc) be an n X s column selection matrix, where Sj is a p x column selection 

matrix and si + • • • = s. Then the U matrix is computed by 

U = (S^C)'^(S^AS)(C^S)^ 

= [diag(SfCi,--- ,SrCfc)]^diag(SfBSi,-- - , S^BSfc) [diag(Cf Si, • • • ,C^Sfc)]^ 

= diag((sTCl)^(SfBSl)(C^Sl)^••• ,(S^Cfc)^(S^BSfc)(C^Sfc)^). 

The approximation formed by the fast model is the block-diagonal matrix whose the i-th 
{i G [A;]) diagonal block is the p x p matrix 



Q(SfQ)^(SfBSO(crS,)^C 


It follows from Lemma 23 that for any i £ [k], 


IIB - 

lim ^ 

o->-i (1 — a)^ 




Thus 


A _ A f^st 112 

II F 


lim 

a^l (1 — a)^ 


a-5-l (1 — a)^ 


||2 


2=1 


E(P-<=.)(!+ 

2=1 * 


^P-Ci 

2=1 




2 = 1 


i=i ** 


2pE^+^ 

2=1 ^ 


, 2nk k-n? 2nk 

> n-c-2k^ -^-^-h k 

c s 

2k\ k{n — s)^ 


/ n/ 2,k\ 

= (n - c) (^1 -h —j + 


Here the inequality follows by minimizing over ci, • • • , and si, • • • ,Sk with constraints 
Fji Ci = c and Yli Si = s. Finally, it follows from Lemma 21 that 


lim 

a^l 


A _ A fast 112 

II F 

l|A-Afe|||, 


> 


n — c 
n — k 




+ 


n — s k{n — s) 
n — k 


F.l Key Lemmas 

Lemma 20 provides a useful tool for expanding the Moore-Penrose inverse of partitioned 
matrices. 
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Lemma 20 (Page 179 of Ben-Israel and Greville (2003)) Given a matrixX. G 
of rank c which has a nonsingular c x c submatrix Xu. By rearrangement of columns and 
rows by permutation matrices P and Q, the submatrix Xu can be bought to the top left 
corner of X, that is, 


PXQ 


Xii Xi2 
X21 X22 


Then the Moore-Penrose inverse of IX. is 


X^ 



(I, + TT^) ^Xr/(I, + H^H) H^]P, 


where T = X]^j^^Xi 2 and H = X 2 iX^j^’^. 


Lemmas 21 and 23 will be used to prove Theorem 19. 


Lemma 21 (Lemma 19 of Wang and Zhang (2013)) Given n and k, we let B be an 

^ X ^ matrix whose diagonal entries equal to one and off-diagonal entries equal to a € [0,1). 
We let A be an n X n block-diagonal matrix 

A = diag(B,--- ,B). (12) 

'-V-' 

k blocks 

Let Afc be the best rank-k approximation to the matrix A, then we have that 

||A-Afc|||. = (1 - a)^(n - A:). 

Lemma 22 The following equality holds for any nonzero real number a: 

(aP + 5ia^)-' = 

a{a + be) 

Proof The lemma directly follows from the Sherman-Morrison-Woodbury matrix identity 
(X -h YZR)“^ = X~^ - X^1 y(Z~^ -h RX~1 y)^^RX“^ 


Lemma 23 Let B be any nxn matrix with diagonal entries equal to one and off-diagonal 
entries equal to a. Let C = BP G let B = C(S^C)t(S^KS)(C^S)tC^ be the fast 

SPSD matrix approximation model of B. Let V and S be the index sets formed by P and 
S, respectively. IfVcS, the error incurred by the fast model satisfies 


lim 


B-BIII, 

(1 — a )2 


> (n - c)(^l -h + 


(n — s)^ 
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Proof Let Bi = S^BS G and Ci = S^C = S^BP G Without loss of 

generality, we assume that P selects the first c columns and S selects the first s columns. 
We partition B and C by: 


B = 


Bi Bl^ 

B 3 B 2 


and 


C = 


■ Cl ■ 
C2 


w 

— 

C12 



^2 


We further partition Bi G by 


Bi = 


w cs 

C12 Bi2 


where 


C 12 = als-cl^ and B 12 = (1 - a)Is_c + als_clf_c- 

The U matrix is computed by 

U = (S'^C)'f(S^BS)(C^S)'f = c|Bi(C|)^. 

It is not hard to see that Ci contains the first c rows of Bi. 

We expand the Moore-Penrose inverse of Ci by Lemma 20 and obtain 




where 


and 


W-' = (l-a)I, + alel 


-1 


H = Ci2W"i = 


1 


a 


1 — a (1 — a)(l — a + ca) 




a 


1 — a + CO 


h-cll 


It is easily verified that H^H = ( {s — c)lcl^. It follows from Lemma 22 that 

T (« - 


(I, + H 




c(s — c)o? + (1 — a + ca) 


-1 1 

2 


Then we obtain 


d = W-i(P + H^H)-'[ P H^] 


(13) 


where 


7i = C7273 - 72 


73 


1 -a’ 


a 


72 = 

73 = 


(1 — a)(l — a + ca) ’ 

{s — c)a‘^ 

c{s — c)cP‘ + (1 — a + ca)^ 
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Then 


[Ie,H'^]Bi[I„H'^]'^ = W + BfgH + H^Bi 3 + H^BiaH 

= (1 - a)Ic+ 74lcl^, 


(14) 


where 


74 = 


t;( 3 as — ac — 2 a + a^c — 3 a^s + + 1 ) 


It follows from (13) (14) that 


U = c 1 Bi(c 1 )^ = 


1 


{ac — a + 1 )^ 


Ic + 7ilcl^) ((1 - «)Ic + 74lclc 


1 — a 

1 r 

Ic + 75lclc, 


1 — a 


Ic + JllcK 


1 — a 


where 


Then we have 


75 = 


7 i + (c7i + (c 7 i 74 + 7 i(l - a) + ' 


WU = B + 76lcl^ 

76 = (1 - a + ac )75 + 


a 


1 — a 


We partition the fast SPSD matrix approximation model by 

B = 

where 


B 21 B 22 


Bn 

B 21 

B 22 


WUW = (1 — a)Ic + (a + (1 — a + ca)76) Icl^, 
WU(alclLc) = a(l+ c76)1c1Lc, 


= (aln-cir)U(alclLc) = 


a^c 


1 — a 


+ 75 C Ucln-c 


The approximate error is 
|2 


B-B||^ = ||W-W||^ + 2||B2i-B2i||^ + ||B22-B 


22 


F’ 


where 


W-w 


= \\{l-a + ca)"fQlcXc\\‘p = C‘^{ 1 - a + caf-"fl 
|B 2 i-B 2 i||^ = ||ac76lclLc||^ = a^c^(’^-c)76. 


IB 22 — B 221 


{n - c){n - c-l)a'^( —h 00^75 - l) +(n-c)f + 0^0^75 - l) . 

VI — a / VI — a / 


off-diagonal 


diagonal 
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We let 

A l|B-B||| 

(1 — a)^ 

which is a symbolic expression of a, n, s, and c. We then simplify the expression using 
MATLAB and substitute the a in 7 / by 1, and we obtain 

lim rf = {n — c)(l + 2/c) + (n — s)^/s^, 

a—>-1 

by which the lemma follows. ■ 


Appendix G. Proof of Theorem 8 

We define the projection operation Pc,fc(A) = CX where X is defined by 

X = argmin ||A — CX||^. 

rank(X)</c 

By sampling c = 2A:e“^(l + o(l)) columns of A by the near-optimal algorithm of 
Boutsidis et al. (2014) to form C G we have that 

E||A-lPc,fc(A)||^ < (l + e)||A-Afcll^. 

Applying Lemma 3.11 of Boutsidis and Woodruff (2014), there exists a much smaller column 
orthogonal matrix Z G such that range(Z) C range(C) and 

EllA-CCfAll^ < E||A-ZZ^AllJ, < ||A - Pc,fc(A)||5,. 

Notice that the algorithm does not compute Z. 

Let G be columns of A^ selected by the randomized dual-set sparsification 

algorithm of Boutsidis et al. (2014). When ri = 0{k), it holds that 

E||A-RiRfA||5, < 2(l + o(l))||A-Afclll,. 

Let R^ G be columns of PJ- selected by adaptive sampling according to the residual 

A"^ — RJ’(R^)1A^. Set r 2 = 2A:e“^(l + o(l)). Let R^ = [R^jR^']. By the adaptive 
sampling theorem of Wang and Zhang (2013), we obtain 

E||A-ZZ^ARtR||^ < E||A-ZZ'^A||5,-F —E||A-ArIr^IIJ, 

< (l + e)||K-Kfc||^ + e||K-Kfc||5, 

< (l + 2e)||K-Kfc||^. (15) 

Obviously R^ contains 

r = ri -|- r 2 = 2A:e“^ (l + o(l)) 
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columns of A^. 

It remains to show IIA - CCtARtR||2, < IIA - ZZ^ARtRlI Since the columns of Z 
are contained in the column space of C, for any matrix Y the inequality ||(Im —CCf)Y||^ < 
(\yn — ZZ^)Y||p holds. Then we obtain 

||A-CC'fAR'fRill, = ||A-ARtR + ARtR-CC^ARtRllI 

= ||A(I„ - R^R)||| + ||(I„^ - CCt)ARtR||^ 

< ||A(I„ - R^R)||^ + 11(1™ - ZZ^)ARtR||| 

= ||A(R-R'fR) + (I^-ZZ^)ARtR||| 

= ||A-ZZ^ARtR||2,. (16) 

The theorem follows from (15) and (16) and by setting e' = 2e. 

Appendix H. Proof of Theorem 9 

In Section H.l we establish a key lemma to decompose the error incurred by the 
approximation. In Section H.2 we prove Theorem 9 using the key lemma. 

H.l Key Lemma 

We establish the following lemma for decomposing the error of the approximate solution. 


Lemma 24 Let A G C G and R G be any fixed matrices, and A = 

UaSaVJ, C = UcScVg, R = UrSrV^ be the SVD. Assume that SgUc and S^Vr 
have full column rank. Let U* and U be defined in (8) and (9), respectively. Then the 
following inequalities hold: 

||A-CUR||| < ||A-CU*R||2,+ (/RV^ + /cv^ + /c/Ry^)", 
||A-CUR||^ < ||A-CU*R|||.+(/RV^ + /cv^ + /c/iiyffcffk)", 


where a G [0,1] is arbitrary, and 

fc = cT-ijUgScSgUc), 
he = ||uSScSg(A - UcUgA)!! 
gc = \\VlScSl{lm - UcUg)UA51l||^, 
g'c = ||uSScS^(I™ - UcUg)UA51l||2, 


fR = a-l(VlSRSlVn), 

hR = ||(A - AVrV^)ScS^Vr|| 5 „ 

gR = ||S^-“VA(In - VrV^)SrS^Vr|| J 

9 'r = ||si-“VA(In - VrV^)SrS^Vr||^. 


Proof Let kc = rank(C) < c and kr = rank(R) < r. Let Uc C ijg jgfi^ singular 

vectors of C and Vr G be the right singular vectors of R. Define Z*, Z G by 

Z* = U^AVr, Z = (S5Uc)^(S^ASR)(V^SR)t. 
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We have that CU*R = CCtARtR = UcUgAVRVg^ = UcZ*V^. By definition, it holds 
that that 

tJ = (SgC)t(S^ASH)(RSR)t 

= (SgUcScVS)t(S^ASH)(UR5]RV^SR)t 
= (5]cVS)t(S^Uc)t(S^ASR)(V^SR)t(UR5]R)t 
= (I]cVg)tz(UR5]R)t, 

where the third equality follows from that S^Uc and S^Vr have full column rank and 
that and have full row rank. It follows that 

CUR = Uc51cVg(5]cVg)tz(URl]R)tURSRV^ = UcZV^. 

Since CU*R = UcZ*V^ and CUR = UcZV^, it suffices to prove the two inequalities: 

||A-UcZV^||| < ||A-UcZ*V^|||+(/RV^ + /cv^ + /c/i?y^)', 
||A-UcZV^|||. < ||A-UcZ*V^|||+(/RV^ + /cV^ + /c/i?y5^)'.(17) 
The left-hand side can be expressed as 

||A-UcZV^||J, = ||(A-UcZ*V^) + Uc(Z*-Z)V^||^ 

= II(I^ - UcUg)A + UcUgA(I„ - VrV^) + Uc(Z* - Z)V^||5, 

= II (I™ - UcUg)A||^ + ||UcUSA(I„ - VrV^) + Uc(Z* - Z)V^|| J 
= ||(I„ - UcUg)A||^ + ||UcUSA(U - VrV^)|| 5, + ||Uc(Z* - Z)V^||^ 

= II (I™ - UcUg)A + UcUgA(I„ - VrV^)|| 5, + ||Uc(Z* - Z)V^||5, 

= IIA - UcUSaVrV^II^ + ||Uc(Z* - Z)V^||^. 

From (17) we can see that it suffices to prove the two inequalities: 

- /rV^ + /c-\Ac + fcfR\Jgcg'n, 

+ fcfn^Jg'cdR- (18) 

We left multiply both sides of Z = (SgUc)l(S^ASR)(V^SR)t by (SgUc)^(S^Uc) 
and right multiply by (V^Sr)(V^Sr)^. We obtain 

(UgScS^Uc)Z(V^SRS^VR) 

= (S^Uc)^(S^Uc)(SgUc)t(S5ASR)(V^SR)t(V^SR)(V^SR)^ 

= (S^Uc)^(S?;ASr)(V^Sr)^ 

= uSScS5(A^ + UcZ*V^)SrS^Vr. 

Here the second equality follows from that Y^YY^ = Y^ and Y^YY^ = Y^ for any Y, 
and the last equality follows by defining A-*- = A — UcZ*V^. It follows that 

(UgScS?:Uc)(Z-Z*)(V^SRS5 VR) = U^ScS^A^SrS^Vr. 
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We decompose A-*- by 

= A - UcUg A + UcUg A - UcUg AVrV^ 

= UcUgA(I„ - VrV^) + (I^ - UcUg)AVRV^ + (I^ - UcUS)A(I„ - VrV^). 

It follows that 

(uSScS?:Uc)(Z - Z*)(V^SrS^Vr) 

= UgScSgUcUg A(I„ - VrV^)SrS^Vr 
+ V^ScS^ilm - UcUS)AVrV^SrS^Vr 
+ VlScSlilm - VcVl)A{ln - VrV^)SrS5Vr, 

and thus 

Z - Z* = UgA(I„ - VrV^)SrSSVr(V^SrS^Vr)-i 

+ (UgScSgUc)”^uSScSg(I„ - UcUg)AVR 

+ {VlScSlVc)-^VlScSlilm - UcUS)A(I - VrV^)SrS^Vr(V^SrSSVr)-^ 

It follows that 

|1Z-Z*||;^ < a-i„(V^SflSSVR)||A(I„-VRV^)SflS^VR||^ 

+ a-l„(USScSgUc)||uSScSS(I„-UcUS)AVR||^ 

+ a-l„(USScSgUc)a-ijV^SflS^VR)||uSScSg(I,„ - UcUS)A(I - VRV^)SflS^VR||^. 
This proves (18) and thereby concludes the proof. ■ 


H.2 Proof of the Theorem 

Assumption 3 assumes that the sketching matrices Sc and Sr satisfy the first two 
approximate matrix multiplication properties. Under the assumption, we obtain Lemma 25, 
which shows that U is nearly as good as U* in terms of objective function value. 

Assumption 3 Let B be any fixed matrix. Let C G and C = be the SVD. 

Assume that a certain sketching matrix Sc G satisfies 

p{||UcScS^Uc-l||2 > ^} < <5i 

pjllugScS^B-UgBllJ, > e||B|||} < <^2 

for any 61,62 G (0, 0.2). Let R G and R = UrXIrV^ be the SVD. Similarly, assume 

Sr G satisfies 

p{||V^SrS^Vr-I||2 > ^} < 6 , 

p{||V^SrS^B-V^B||^ > 6||B||2,} < 52. 
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Lemma 25 Let A G C G and R G be any fixed matrices. Let U* and 

U be defined in ( 8 ) and (9), respectively. Let kc = rank(C), kr = rank(R), q = min{m,n}, 
and e G (0,1) be the error parameter. Assume that the sketching matrices Sc and Sr satisfy 
Assumption 3 and that = o{q). Then 

||A-CUR||| < (1+ 4e2g) ||A-CU*R|||, 
holds with probability at least 1 — 26i — 362 . 


Proof Let fc, /r, he, Lr, gc, gR, g'^, q'r be defined Lemma 24. Under Assumption 3, we 
have that 


fc < y, hc< e||A - < e||A - CU*R^|||, 

fR < y, hR< e||A - AVrV^||| < e||A - CU*R^|||, 

hold simultaneously with probability at least 1 — 25i — 282 . 

We fix a = 1, then ^rc = he, and < 7 ^ < ||(In—V r,V^)SrS]JVr|| 2 . Under Assumption 3, 
we have that 


^Jg'R < ||(In-VRV^)SRSSVR-(R-VRV^)VR||^ 

< V^||(In-VRV^)||^ < 

holds with probability at least 1 — 52 . It follows from Lemma 24 that 

IIA - CURllI - IIA - CU*R||| 

< + /c\Ac + fcfR\JgcgR^j 

< (y V^IIA - CU*R'^||f + ^eV^IIA - CU*R^||c)^ 

= ^e2n(l + o(l))||A-CU*R^||| < 4e2n|| A - CU*R'^|||. 

holds with probability at least 1 — 25i — 3 ^ 2 . Here the equality follows from that e~^ = o{n). 
Alternatively, if we fix a = 0, we will obtain that 

||A-CUR||| < ||A - CU*R||| Ade^mllA - CU*R'^||| 

with probability 1 — 25i — 3 ^ 2 . Therefore, if n < m, we fix a = 1; otherwise we fix a = 0. 
This concludes the proof. ■ 


In the following we further assume that the sketching matrices Sc and Sr satisfy the 
third approximate matrix multiplication property. Under Assumption 3 and Assumption 4, 
we obtain Lemma 26 which is stronger than Lemma 25. 
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Assumption 4 Let B he any fixed matrix. Let C G kc = rank(C), and C = 

UcScVg be the SVD. Assume that a certain sketching matrix G satisfies 

p{||uSScS?:B-UgB||^ > e||B||2 + A||B|||} < ^3 

for any e G (0,1) and 63 G (0,0.2). Let R G kr = rank(R), and R = UrSrV^ be 

the SVD. Similarly, assume that Sr G satisfies 

p{||V^SrSSB-V^B||^ > e||B||2 + A||B|||} < ^3. 

Lemma 26 Let A G C, R, U*, U, kc, kr be defined in Lemma 25. Let q = 

min{m, n} and q = mm{m/kc,n/kr}. Assume that the sketching matrices Sc and Sr 
satisfy Assumption 3 and Assumption 4 o,nd that = o[qj. Then 

||A-CXR||| < (1+ 4e2g) ||A-CX*R||| 

holds with probability at least 1 — 26i — 262 — 63 . 


Proof Let fc, fR, he, hR, gc, gR, g'c, q'r be defined Lemma 24. Under Assumption 3, we 
have shown in the proof of Lemma 25 that 


/c<y, he < e||A - CU*R^||| 
hR<e||A-CU*R'^||| 


hold simultaneously with probability at least 1 — 25i — 262 . 

We fix a = 1, thence = he, Sind gfi < ||(In— VrV^)SrS|(Vr|| 2 . Under Assumption 4, 
we have that 


Qr < ||(In-VRV^)SRS^VK-(I„-VRV^)VR 


< e||R-VRV^||^ + f ||I„-VrV^||^ < e + 


=0 

T l|2 


e(n — kr) en 


kr 


kr 


holds with probability at least 1 — ^ 3 . It follows from Lemma 24 that 

IIA - CURllI - ||A - CU^RllI 

< + feV^ + fcfR\J 9 c9'r^ 

< (y\/^I|A - CU^R^IIe + ^evW^IIA - CU*R^||f)^ 

1 0 ^ 

= -^e2nA:-^(l + o(l))||A-CU*R^||| < 4e2nA;-^||A - CU*R^||^ 

holds with probability at least 1 — 25i — 282 — 82 ,. Here the equality follows from that 
e~^ = oinjkr). 
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Analogously, by fixing a = 0 and assuming e ^ = o{m/kc), we can show that 
||A-CUR|||-||A-CU*R|||, < 4e^mk-^\\A-CV^K^\\l 
holds with probability at least 1 — 25i — 262 — 63 . This concludes the proof. ■ 


Finally, we prove Theorem 9 using Lemma 25 and Lemma 26. 

For leverage score sampling, uniform sampling, and count sketch, Assumption 3 is 
satisfied. Then the bound follows by setting e = 0.5y^ e'/q and applying Lemma 25. Here 
q = min{m, n}. For the three sketching methods, we set = 0.01 and 82 = 0.093. 

For Gaussian projection and SRHT, Assumption 3 and Assumption 4 are satisfied. 
Then the bound follows by setting e = O.by^ e'/q and applying Lemma 26. Here q = 
mm{m/kc, n/kr}■ For Gaussian projection, we set = 0.01, 62 = 0.09, and ^3 = 0.1. For 
SRHT, we set (5i = 0.02, 62 = 0.08, and (ja = 0.1. 
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